Life, Death and Preferential Attachment 
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Scientific communities are characterized by strong stratification. Tfie fiigfily skewed frequency 
distribution of citations of publisfied scientific papers suggests a relatively small number of active, 
cited papers embedded in a sea of inactive and uncited papers. We propose an analytically soluble 
model which allows for the death of nodes. This model provides an excellent description of the 
citation distributions for live and dead papers in the SPIRES database. Further, this model suggests 
a novel and general mechanism for the generation of power law distributions in networks whenever 
the fraction of active nodes is small. 

PACS numbers: 89.65.-s, 89.75.-k 
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That progress in science is driven by a few great con- 
tributions becomes disturbingly clear when one considers 
citation statistics. The vast majority of scientific papers 
is either completely unnoticed or minimally cited. In high 
energy physics, 4% of all papers account for 50% of the 
citations, while 29% of all papers are not cited at all 0. 

In a pioneering sociological work analyzing American 
high energy physicists. Cole and Cole [2| connect this 
high degree of stratification in the scientific literature to 
what they call cumulative advantage. The concept un- 
derlying cumulative advantage was originally introduced 
by R. K. Merton Q with the more striking name of the 
'Matthew Effect'. Merton's simple observation was that 
success seems to breed success. A paper which has been 
cited many times is more likely to be cited again than 
one which is less cited, since "unto every one that hath 
shall be given, and he shall have abundance: but from 
him that hath not shall be taken away even that which 
he hath" Q — hence the name. 

Inspired by Refs. 0, 0] and his own work on citation 
networks de SoUa Price recast Simon's 6] ideas on the 
mathematics leading to the power law distributions found 
in nature and society into the first mathematical model 
of a scale- free network f^. Much later, the principles un- 
derlying Price's model were independently re-discovered 
by Barabasi and Albert Q , who coined yet another name 
for the same effect, namely preferential attachment. Pref- 
erential attachment has since become a widely accepted 
explanation of the power law degree distributions in com- 
plex networks in general. The strength of the preferential 
attachment model in either incarnation is its simplicity, 
but this can also be its weakness. In particular, such 
models tend to assume that networks are homogeneous. 
When real world networks can be shown to have identifi- 
able and significant inhomogeneities, preferential attach- 
ment must be supplemented by appropriate additional 



ingredients. 

For example, it is an empirical fact that the vast major- 
ity of nodes in citation networks "die" after a relatively 
short time and are never cited again. A relatively small 
population of papers remains alive and continues to accu- 
mulate citations many years after publication; this is the 
main conclusion in Ref . 0| . The distinction between live 
and dead populations represents an important inhomo- 
geneity in the citation data that is not considered in the 
simple preferential attachment model. We do not suggest 
that the presence of death in citation networks diminishes 
the importance of preferential attachment, however, the 
distinctly different citation distributions observed for live 
and dead papers compel us to include the effects of the 
death of papers in our modeling efforts. It is the purpose 
of this paper to suggest one such extension of preferential 
attachment models. 



DEAD PAPERS 

The work in this paper is based on data obtained from 
the SPIRES database of papers in high energy physics. 
To be specific, the data used below is the network of 
all citable papers from the Theory subfield of SPIRES, 
ultimo October 2003. Filtering out all papers for which 
no information of publication time is available, we are 
left with a network of 275 665 nodes (i.e., papers). All 
citations to papers not in this network were removed, 
resulting in 3 434175 edges (i.e., citations). 

Clearly, there is a variety of ways to define what is 
meant by a dead node in real data[l3- We have tested 
several definitions, and our results are qualitatively in- 
dependent of the specifics of the definition. We have 
chosen to define papers that have not been cited in 2003 
to be dead. Having identified a population of dead pa- 
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pers, we have determined the citation distributions for 
hve and dead papers. These distributions are shown in 
Figure 121 a) and indicate that the two distributions are 
significantly different. As suggested in the introduction, 
most (i.e., approximately three-quarters) of the papers 
in SPIRES are dead. It is also a simple matter to de- 
termine the empirical ratio of live to dead papers as a 
function of the number of citations per paper k. Figure^ 
displays this ratio in the range 1 < fc < 150. Over most 
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FIG. 1: The ratio of live to dead papers. The solid straight line 
has been inserted to illustrate the linear relationship between 
the live and dead populations for low values of k. The error 
bars are calculated from the square roots of the citation counts. 

of this range the data is described by a straight line. We 
note that the data for dead papers with high fc-values is 
very sparse. Since only 0.15% of dead papers have more 
than 100 citations, statistics beyond this point are highly 
unreliable. Thus, plotting the ratio of live to dead pa- 
pers gives a pessimistic representation of the data. The 
ratio of dead to live papers is described satisfactorily by 
the simple form h/{k -t- 1) for all but the highest val- 
ues of k, where this form overestimates the number of 
dead papers by a factor of two to three. In short. Fig- 
ure ^ implies that — to a fairly good approximation — the 
fraction of dead papers with k citations is proportional 
to l/{k -\- 1). We will make use of this fact in the next 
section to suggest an extension of the preferential attach- 
ment model which includes the effects of death. 

MODELING DEATH AND PREFERENTIAL 
ATTACHMENT 

Following the usual structure of preferential attach- 
ment models, we imagine that at every update a new pa- 
per makes m references to papers already in the network 
and then enters the network with fc = real citations 
and fco = 1 "ghost" citations. Since we have chosen to 
eliminate all references to papers not in SPIRES in con- 
structing our data set, there is an obvious and rigorous 
sum rule that the average number of citations per paper is 



also m. The probability that a paper in the network will 
receive one of these references is assumed to be propor- 
tional to its current total of real and ghost citations. We 
can estimate when the effects of preferential attachment 
become important by regarding fco as a free parameter. 
Since we see no a priori reason why a paper with 2 ci- 
tations should have a significant advantage in acquiring 
citations over a paper with 1 citation, we prefer to allow 
the data to decide. Thus, in our model, the probability 
that a paper with k citations acquires a new citation at 
each time step is proportional to k + k^ with fco > 0. We 
can think of the displacement, /cqi as offering a way to 
interpolate between full preferential attachment (fco = 1) 
and no preferential attachment (fco — > oo). 

More importantly, at every update each live paper in 
the network has some probability of dying. Guided by 
the SPIRES data, we assume that this probability is pro- 
portional to l/(fc + 1) for a paper with k real citations. 
Once dead, a paper can no longer receive new citations. 
In his 1976 paper. Price notes that cumulative advan- 
tage is only half the Matthew Effect, because although 
success is rewarded, there is no punishment for failure. In 
this sense, the model described here represents one imple- 
mentation of the full Matthew Effect. Since the rate at 
which papers are killed is inversely proportional to the 
number of citations which they have, low cited papers 
have a much higher probability of paying the ultimate 
penalty. 

The rate equation approach introduced in the context 
of networks by Krapivsky, Redner, and Leyvraz |^ can 
easily be modified to allow for death. We let Lk be the 
probability for finding a live paper with k citations and 
Dk be the probability of finding a dead paper with k cita- 
tions. Each paper cites m other papers in the database. 
Papers are loaded into the database with in-degree fc = 0. 
We arrive at the following rate equations 

Lk = m{\k-iLk-i - XkLk) - rjkLk + 5kfi (1) 
Dk = VkLk, (2) 

where Xk and rjk are rate constants. We define Lk to 
be equal to zero for fc < and since every paper has a 
finite number of citations, the probabilities Lk must be- 
come exactly zero for sufficiently large fc. Thus, we can 
let all sums run from fc = to infinity. While the total 
citation distribution is, of course, given by Lk + Dk, we 
can also probe the live and dead distributions separately 
both theoretically and empirically. For any choice of Afc 
and rjk these equations trivially satisfy the normalization 
condition on the total distribution. However, the con- 
straint that the mean number of references equals the 
mean number of citations, fc(Lfc -|- Dk) = m, must be 
imposed by an overall scaling of the Afc and 77^. Eq. (2) 
shows that the coefficients, r]k, are simply the ratio of 
dead to live papers as a function of fc. Given the em- 
pirical values of this ratio shown in Figure 1, our model 
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corresponds to the case where 

mXk = a{k + ko) and rjk 
Performing the recursion, we find 



k + 1 



(3) 



r{k + 2)T{k + ko) r(i-fci) T{i-k2) 



afcifca T{ko) T{k-ki + l)r{k-k2 + l)' 

(4) 

where ki and k2 are the solutions to the quadratic equa- 
tion 



{a{k + ka) + l){k + l) + b = 



(5) 



regarded as a function of k. 

One general observation of some interest emerges in the 
limit ko —>■ oo in which preferential attachment is turned 
off. We obtain this limit by making the replacement a ~ 
ako in Eq. Q and then taking the limit ko ~^ oo for fixed 
a. A little work reveals that 
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k+l 



(6) 



The Dk are simply bLk/{k + 1) as before. (Eq. © can 
also be obtained by solving Eqs. Q and ||2Jl with con- 
stant Xk and rjk — b/{k + 1); the two approaches are 
equivalent.) When the death mechanism is eliminated 
by setting 5 = 0, the resulting distribution shows an 
exponential decrease which is to be expected given the 
assumed absence of preferential attachment. 

In fact, the death of nodes offers an alternative mech- 
anism for obtaining power laws. To see this, consider the 
limit a —^ oo and 6 — > oo with the ratio r — b/{a -I- 1) ~ 
b/a fixed. In this limit it is tempting to replace the term 
a/{l + a) by 1, which allows us to compute simple expres- 
sions for the fraction of dead papers / and the average 
number of citations of the live and dead papers, rriL and 
mo- (This approximation is appropriate when r > 2. 
When r < 2 the neglected factor is essential for ensuring 
the convergence of and/or mu.) The fraction of live 
papers is then 
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a{r — 1) ' 



(7) 



and the average number of citations for the live papers 
and dead papers, respectively, is 



rriL = 



and rriD 
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(8) 



The average number of citations for all papers is evi- 
dently mjj in the limit a — > oo for which / — )■ 1. Most 
importantly, we see in this limit that 
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and Dk 



(9) 



for k > r. Thus, we see that power-law distributions for 
both live and dead papers emerge naturally in the limit 
where the fraction of dead papers / goes to 1. In this 
limit, a vanishing fraction of live papers swim in a sea 
of dead papers. Since such power laws are sometimes 
regarded as an indication of preferential attachment, it 
is useful to see a quite different way of obtaining them. 



DEATH IN THE REAL WORLD 

We now return to the full model and compare it to 
the data from SPIRES. If we assign all zero cited papers 
to the dead category, the mean number of citations is 
34.1 for live papers, 4.5 for dead papers, and 12.5 for 
all papers. The fraction of live papers is 27.0%. By 
minimizing the squared fractional error, we can fit the 
live data with an rms error of only 21% using the forms 
of Eqns. 10} and lO with the parameters ko = 65.6, a = 
0.436, and b — 12.4. Given that the data spans six orders 
of magnitude, the quality of this agreement is strikingly 
high. The results of the fits are displayed in Figure |2 
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FIG. 2: (a) Log-log plots of the distributions for live and dead 
papers. The triangles are the live data and the squares are the 
dead data. The solid lines are the fit. (b) A log-log plot of the 
distribution of all papers (live plus dead). The points are the 
data; the solid lines are the fit. 
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The fitted mean number of citations is 32.9 citations 
for live papers, 4.25 for dead papers, and 12.8 for all 
papers. According to the fit, 7.5% of all papers with 
citations are, in fact, alive. Assigning this fraction of 
zero citation papers to the live data, we find mean ci- 
tations of 31.5, 4.6, and 12.5 respectively. We also find 
that 29.2% of the papers in the model are live. This is in 
excellent agreement with the data. There is remarkably 
little strain in the fit. We can, for example, determine the 
model parameters a, b, and fco from the empirical values 
of m^, mo, and /. This leads to small changes in the 
model parameters and yields a description of comparable 
quality for the distributions. It is clear from Figure[21that 
the present fit to the live distribution leads to some sys- 
tematic errors in the description of the dead population 
for the highest values of k. Given the deviations from 
a straight line of the data of Figure for large k, this 
comes as no surprise. This could obviously be remedied 
by a small modification of the r]k through the inclusion 
of a suitable term in the denominator. 

It is clear that the present simple model is capable 
of fitting the distributions of both live and dead papers 
with remarkable accuracy. We note that the best fit value 
of the parameter /cq — 65.6 suggests that a paper with 
A: = 66 citations has a competitive advantage over a paper 
with no citations of a factor of 2 rather than the factor 
of 67 suggested by the simplest preferential attachment 
models. 



DISCUSSION AND CONCLUSIONS 

It is obvious that the death mechanism introduced here 
is essential if we wish to consider the empirical citation 
distributions of live and dead papers separately. It is less 
obvious that the death mechanism (i.e., 6 ^ 0) is required 
to provide a good description of the total citation data. A 
similar fit to the citation distribution for all papers with 
the constraint 6 = yields the parameters a = 0.528 and 
ko = 13.22 and gives an rms fractional error of 33.6%. 
Although there are some indications of systematic devia- 
tions in the resulting fit, its overall quality remains high 
in spite of the fact that this constrained fit ignores impor- 
tant correlations present in the data set. This result illus- 
trates the familiar fact that more detailed modeling is not 
necessarily required to fit global network distributions 
even if important empirical correlations are neglected in 
the process. It also reminds us of the equally familiar 
corollary that even a high quality fit to global network 
distributions cannot safely be regarded as an indication 
of the absence of additional correlations in the data. The 
most significant difference between the model parameters 
obtained with and without the death mechanism is the 
value of ko, which changes by a factor of 5 from 65.6 
to 13.2. We have an intuitive preference for the larger 
value. (We believe that preferential attachment will play 



an important role when a paper is sufficiently visible that 
authors feel entitled to cite it without reading it and that 
ko « 65 represents a reasonable threshold of visibility.) It 
is clear, independent of such subjective preferences, that 
it is dangerous to assign physical significance to even the 
most physically motivated parameters if a network con- 
tains unidentified correlations or if known correlations 
are neglected in the modeling process. Specifically, it is 
difficult to draw firm conclusions regarding the onset of 
preferential attachment if the death mechanism is not 
included. 

We have identified significant differences between the 
citation distributions of live and dead papers in the 
SPIRES data, and we have constructed a model including 
both modified preferential attachment and the death of 
nodes that is quantitatively successful in describing these 
differences. We have further seen that the death mecha- 
nism can provide an alternate mechanism for producing 
power law distributions when the fraction of live nodes 
is small. Since many networks involve a small fraction 
of active nodes, this mechanism may be of more general 
utility. However, the numerical success of the present 
model does not indicate the absence of additional corre- 
lations in the SPIRES data. In fact, we know that such 
correlations exist. Consider the conditional probability, 
P{k\ifi), that a paper written by an author with a lifetime 
average of to citations per paper will receive k citations. 
The general interest in citation data is based on the wide- 
spread intuitive belief that P(k\rh) is a sensitive function 
of TO. This belief is supported by the SPIRES data and 
will be treated in a subsequent publication. 

Our grateful thanks to Travis C. Brooks at SPIRES 
without whose swift replies and thoughtful help we would 
have lacked all of the data! 
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